Towards Scalable Data-Driven Authorship Attribution
نویسندگان
چکیده
Traditional authorship attribution approaches have made attempts at capturing features that were designed heuristically – researchers guessed at which aspects of language would best separate one author from another and then performed experiments to see how valid their assumptions were. While this approach has met some success, it also proves to be unscalable – most test collections to date have been on the size of 10 or less authors, which in the age of internet-style publication is an unrealistically low quantity. We believe that this approach to feature selection for authorship attribution adds unnecessary complexity to what the task really seems to be: a multiclass classification problem, and one where the most useful features can be easily discovered using a standard dimensionality reduction technique. We demonstrate the use of such a technique to dramatically reduce the number of used features for authorship attribution using an implementation of Support Vector Machines.
منابع مشابه
Towards a better understanding of Burrows's Delta in literary authorship attribution
Burrows’s Delta is the most established measure for stylometric difference in literary authorship attribution. Several improvements on the original Delta have been proposed. However, a recent empirical study showed that none of the proposed variants constitute a major improvement in terms of authorship attribution performance. With this paper, we try to improve our understanding of how and why ...
متن کاملPREPRINT VERSION An agent-driven semantical identifier using radial basis neural networks and reinforcement learning
Due to the huge availability of documents in digital form, and the deception possibility raise bound to the essence of digital documents and the way they are spread, the authorship attribution problem has constantly increased its relevance. Nowadays, authorship attribution, for both information retrieval and analysis, has gained great importance in the context of security, trust and copyright p...
متن کاملEffective and Scalable Authorship Attribution Using Function Words
Techniques for identifying the author of an unattributed document can be applied to problems in information analysis and in academic scholarship. A range of methods have been proposed in the research literature, using a variety of features and machine learning approaches, but the methods have been tested on very different data and the results cannot be compared. It is not even clear whether the...
متن کاملAn Agent-driven Semantical Identifier Using Radial Basis Neural Networks and Reinforcement Learning
Due to the huge availability of documents in digital form, and the deception possibility raise bound to the essence of digital documents and the way they are spread, the authorship attribution problem has constantly increased its relevance. Nowadays, authorship attribution, for both information retrieval and analysis, has gained great importance in the context of security, trust and copyright p...
متن کاملDomain Independent Authorship Attribution without Domain Adaptation
Automatic authorship attribution, by its nature, is much more advantageous if it is domain (i.e., topic and/or genre) independent. That is, many real world problems that require authorship attribution may not have in-domain training data readily available. However, most previous work based on machine learning techniques focused only on in-domain text for authorship attribution. In this paper, w...
متن کامل